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■ Abstract. We show that the neighbor-joining algorithm is a robust quartet method 

I for constructing trees from distances. This leads to a new performance guarantee that 

contains Atteson's optimal radius bound as a special case and explains many cases 
Ch , where neighbor-joining is successful even when Atteson's criterion is not satisfied. We 

^ ' also provide a proof for Atteson's conjecture on the optimal edge radius of the neighbor- 

joining algorithm. The strong performance guarantees we provide also hold for the 
quadratic time fast neighbor-joining algorithm, thus providing a theoretical basis for 
inferring very large phylogenies with neighbor-joining. 
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1. Introduction 



en ■ The widely used neighbor-joining algorithm [24j has been extensively analyzed and 

^ ■ compared to other tree construction methods. Previous studies have mostly focused 
Tj- ! on empirical testing of neighbor- joining. Examples include the comparison of neighbor- 
O I joining with quartet [15] and maximum likelihood [11] methods, comprehensive compar- 
Q ■ isons of multiple programs [I3l [16], and detailed testing of the limits of the neighbor- 
^ ■ joining algorithm [T7]. These studies have concluded that neighbor- joining is effective 
for many problems, and have recommended the algorithm. For example, in [15j it is 
remarked that "quartet-based methods are much less accurate than the simple and effi- 
cient method of neighbor-joining". In a recent study, Tamura et al. [2S| conclude that 
there are "bright prospects for the application of the N J and related methods in inferring 
\ large phylogenies." 

\ Furthermore, new methods are now almost always compared with neighbor- joining 

to establish an improvement in performance [21 [H [TTl [191 (201 [231 [21] • In other words, 
neighbor-joining has become the standard by which new phylogenetic algorithms are 
compared, and continues to surface as an effective candidate method for constructing 
large phylogenies. This is remarkable, considering the simplicity of the neighbor-joining 
algorithm. We begin with a review of the neighbor-joining algorithm and a very brief 
description of basic concepts involved. The definitions and notation are based on |26j . 
The reader unfamiliar with the concepts described below should consult this reference 
for full details. 

Throughout the text a 'phylogenetic X-tree T denotes a binary tree describing the 
evolutionary relationships between a set of taxa X, which label the leaves of the tree. 
We will also use the term phylogenetic tree when the set X is clear from the context. A 
cherry of T is a pair of leaves (or taxa) {i,j) G (^) such that the path from i to j in T 
has length exactly 2 (i.e. i,j have a common "parent" in T). 
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A split y4|i? of X is a bipartition X = AU B and A H B = ^. In general we will use 
splits of X induced by removing an internal edge of T (which will result in creating two 
disconnected trees, one with leaf set A and the other with leaf set B). 

A dissimilarity map on X is a map 5 : X x X — >■ M satisfying S{x,y) = 6{y,x), 
S{x,y) > and 6{x,x) = for all x,y & X. A tree metric, or an additive dissimi- 
larity map, is a dissimilarity map 6 for which there exists a tree T with edge lengths 
/ : E(T) — > (0, oo) such that = J2eeP{i where P{i,j) is the set of edges in 

E{T) on the path from i to j in T. We note that for an additive dissimilarity map, the 
tree topology and edge lengths / are uniquely defined. 

We are now ready to give the full description of neighbor-joining: 

(1) Given a set of taxa X and a dissimilarity map 5 : X x X ^ M (this is a map 
that satisfies = S{j,i) and 6{i,i) = 0), compute the Q-criterion for 6 

Then select a pair a, b that minimize Qs as motivated by the following theorem: 

Theorem 1 (Saitou-Nei [24j and Studier-Keppler [28j). Let 6t be the tree metric 
corresponding to the tree T. The pair a, b that minimizes Q5j,{i,j) is a cherry in 
the tree. 

(2) If there are more than three taxa, replace the putative cherry a and b with a leaf 
jab, and construct a new dissimilarity map where S{i,jab) = |(5(^,a) + ^{hb)). 
This is called the reduction step. 

(3) Repeat until there are three taxa. 

Although the Q-criterion is easy to compute, the formula seems, at first glance, some- 
what contrived and mysterious. However, the formulation of the Q-criterion is not 
accidental and has many useful properties. For example, it is linear in the distances, is 
permutation equivariant (the input order of the taxa don't matter), and it is consistent, 
i.e., it correctly finds the tree corresponding to a tree metric. Bryant [3J has shown 
that the Q-criterion is in fact the unique selection criterion satisfying these properties. 
Gascuel and Steel [12] in an excellent review, provide a very precise mathematical an- 
swer to "what does neighbor- joining do?", via a proof that neighbor-joining is a greedy 
algorithm which "decreases the whole tree length" as computed by Pauplin's formula 

mm- 

These results motivate the neighbor-joining algorithm, but they do not offer any 
insight into its performance with dissimilarity maps that are not tree metrics. More im- 
portantly, they do not address the central question of the behavior of neighbor-joining 
on dissimilarity maps that arise from maximum likelihood estimation of distances be- 
tween sequences in multiple alignments. There is one result that addresses precisely 
these issues: 

Theorem 2 (Atteson [Ij). Neighbor-joining has Zoo radius |. 
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This means that if the distance estimates are at most half the minimal edge length 
of the tree away from their true value then neighbor-joining will reconstruct the correct 
tree. Atteson's theorem shows that neighbor-joining is statistically consistent. Infor- 
mally, this means that neighbor-joining reconstructs the correct tree from dissimilarity 
maps estimated from sufficiently long multiple alignments. This has been a widely used 
justification for the observed success of neighbor-joining, and is regarded as the definitive 
explanation for "when does neighbor-joining work?" [12]. 

However, as noted in [18], Atteson's condition frequently fails to be satisfied even 
when neighbor-joining is successful. This is also remarked on in [6]: "In practice, most 
distances are far from being nearly additive [satisfying Atteson's condition]. Thus, al- 
though important, optimal reconstruction radius is not sufficient for an algorithm to 
be useful in practice." Our main result is an explanation of why neighbor-joining is 
useful in practice. We obtain our results using a new consistency theorem (Theorem 
in Section 4). Roughly speaking, our theorem states that neighbor-joining is successful 
(globally) when it works correctly (locally) for the quartets in the tree. Thus, Theorem 
[T6] provides a crucial link between neighbor-joining and quartet methods, a connection 
that is first developed in Section 3. We also show that Atteson's theorem is a special 
case of our theorem. 

In Section 5 we present a proof of Atteson's conjecture on the optimal edge radius 
of neighbor-joining. For a dissimilarity map 6 whose Zoo distance to a tree metric 6t is 
less than |, we prove that the output T' of neighbor-joining applied to 6 will contain 
all edges of T of length at least e, i.e., neighbor- joining has loo edge radius j. We say 
that T' contains the edge e if there exists an edge e' G T' such that the spht of the 
taxa induced by removing e' from T' is the same as that induced by removing e from T. 
This result is tight, as [1] provides a counterexample for edge radius larger than i. The 
methods we employ for this result are virtually the same as the ones used in the proof of 
Theorem 16, although the proof is non-constructive and based on an averaging argument. 
We conclude with a simulation study that establishes the practical significance of our 
theorems in Section 6. 



2. The shifting lemma 

In this section we provide a few preliminary observations that are necessary for our 
results. We begin by reviewing a lemma from pH], namely that there is an alternative 
to the reduction step (2) in the neighbor joining algorithm. Condition (2'): if there are 
more than three taxa, replace the putative cherry a and b with a leaf jab, and construct 
a new dissimilarity map where S{i,jab) = ^{^ih o) + S{i, h) — 5{a, h)). 

Notice that the only difference between (2) and (2') consists of addition of the — 5(a, h) 
term. On a tree metric, the first version of the algorithm corresponds to replacing a 
cherry by a single node whose distance to the rest of the tree is the average of the dis- 
tances of the two collapsed nodes. In the second version, the reduction step is equivalent 
with replacing the cherry by its "root", i.e., the point where the path between the cherry 
nodes connects to the rest of the tree. 
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Definition 3. We say the dissimilarity map 5' is a shift of 5 if and only if there exists 
a fixed e and a distinguished taxon a, such that 5'(a, x) = 6{a, x) + e for all x 7^ a and 
S'{x, y) = 6{x, y) for all x,y ^ a. 

Lemma 4 (The shifting lemma [ID]). Shifting does not affect the outcome of neighbor- 
joining. 

Proof: This follows from the observation that shifting by e around a changes the 
value of y) by exactly — 2e, for all pairs x, y. Moreover, the result of collapsing taxa 
X, y in the shifted dissimilarity map is the same as the result of collapsing the same taxa 
in the initial dissimilarity map, or an e shift of it. By these two observations, at each 
step neighbor-joining will collapse the same pairs as before the shift. □ 

Corollary 5. The two versions of neighbor- joining are equivalent: they produce trees 
with the same topology. 

Proof: Collapsing x, y by the second reduction method gives a dissimilarity map that 
is a ^^^^^ -shift of the one produced by collapsing by the first type of reduction step. □ 

It is important to notice that the operation of shifting a real tree metric 5t by e around 
taxon a corresponds exactly to modifying the length of the leaf edge of T corresponding 
to a by exactly e. So in effect we can allow negative leaf edges since shifting around 
all leaves by a large enough constant will make them positive, while the outcome of the 
algorithm is the same. Of course, the statement of our edge radius results does not make 
sense in the case of negative edge lengths. However, the only negative edges are the leaf 
edges, and in that case the statements we make are vacuous. They hold trivially since 
neighbor-joining reconstructs the bipartition consisting of one taxon vs. the rest of the 
taxa set correctly, regardless of the input. 

In the remainder of the paper, by a tree metric we will mean a shift of a tree metric, 
i.e. a metric corresponding to a tree where leaf edges are allowed to be negative. 

3. Quartets and neighbor- joining 

We now show that for four taxa, neighbor joining is equivalent to the four point method 
[7]. We will use the notation {ij : kl) to denote the tree topology on the set fc,/} 
where the pairs (i, j) and (/c, /) form cherries separated by a middle edge. When i, j, fc, I 
are leaves (internal or external) of a tree T, we will say that {ij : kl) is a quartet of T if 
the topology induced by T on the four nodes is that indicated by {ij : kl) . 

Proposition 6. Let X = {i,j,k,l} and 6 : X x X M. be a dissimilarity map. 
The neighbor-joining algorithm will return the tree {ij : kl) where S{i,j) -\- S{k,l) < 
min{6{i, k) -\- 6{j, I), 6{i, I) -\- 6{j, k)). 

This result can be easily derived using the Q-criterion, but we prefer to motivate it 
using an alternative formulation of the neighbor-joining criterion formulated in pU] . 
For a dissimilarity map 6, let 

ws{tj ■■kl) = ^ (5(z, k) + 6{t, I) + 6{j, k) + 6{j, I)) - 5(z, j) - 6{k, I). 
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Note that for a quartet {ij : kl) in a tree T with corresponding tree metric 6t, 'Wsj,{ij '■ 
kl) is double the length of the internal edge in the quartet. In [10] this is called a 
"neighborliness" measurement. 

Theorem 7. If 5t is the tree metric corresponding to a phylogenetic X-tree T and 

then the pair a, b that maximizes Zsj,{i,j) is a cherry in the tree. 

Proof: Let n = \X\. The sum in Z^j, is over unordered pairs of X — {i,j}- Observe 
that for any 6 

Zsii,]) = -LsiT) -Qsii,]) 
where Ls{T) = "^^i^ J2x yeT ^t{x, y) does not depend on i or j. The theorem now follows 

directly from Theorem [H □ 
Although the naive computation of the Z-criterion requires quadratic time, the equiv- 
alence with the Q-criterion shows that each entry in the Z- matrix is just a sum of a 
linear number of distances. One may therefore wonder why the Z-criterion is worth 
mentioning at all. We outline a number of reasons why the Z-criterion may be a more 
natural way to formulate the neighbor-joining selection criterion. For example, note that 
in the case of four taxa k, I, the Z-criterion is just Zs{i,j) = lws{ij ■ kl) and Propo- 
sition E] follows immediately from Theorem [71 The Z-criterion also highlights the fact 
that for a tree metric, the neighbor-joining selection criterion does not depend on the 
length of edges adjacent to leaves. This is remarked on in the proof of the consistency of 
neighbor-joining in j3]. Furthermore, for a dissimilarity map 6, Zs{i,j) is precisely the 
difference between the balanced minimum length of 6 with respect to the star tree, and 
the length with respect to the tree containing the cherry {i,j) with the remaining taxa 
unresolved (see Figure 1 in [12] and the accompanying discussion). 

Finally, the Z-criterion highlights the connection between neighbor- joining and quar- 
tet methods. Recall that the naive quartet method consists of choosing a quartet for 
each four taxa using the four point method (Proposition [H]), and then returning the tree 
consistent with all the quartets (if such a tree exists). This leads us to 

Definition 8. A dissimilarity map 6 is quartet consistent with a tree T if for every 
{ij : kl) G T, ws{ij : kl) > max{ws{ik : jl),ws{il : jk)). 

By definition, the naive quartet method will reconstruct a tree T from a dissimilarity 
map (5 if (5 is quartet consistent with T. We note that this is essentially the same as the 
ADDTREE method [25], with the minor difference that ADDTREE always outputs a 
tree, albeit potentially the wrong one if 6 is not quartet consistent with T. 

In the next section we will prove the following extension of Proposition (6) 

Theorem 9. //4 < |X| < 7 and 6 : X x X ^ M. is a dissimilarity map that is quartet 
consistent with a binary tree T then the neighbor-joining algorithm applied to 6 will 
construct a tree with the same topology as T. Furthermore, if 5 < \X\ < 7 then there 
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exists e > such that if \\6 — 6\\oq < t, neighbor- joining applied to 5 will reconstruct a 
tree with the same topology as T . 

As we have pointed out, neighbor- joining is equivalent to the naive quartet method 
and ADDTREE for |X| = 4. Theorem [9] states that neighbor-joining is at least as good 
as the naive quartet method for trees with at most 7 taxa, and is in fact robust to small 
changes in the metric. 




Figure 1. A five leaf tree. 



Example 10. Let T be the 5 leaf tree shown in Figure 1 that corresponds to the tree 
metric and consider the distorted dissimilarity map 5 
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Note that 6 is not quartet consistent with T, because 6{a, e) + 6{b, c) < 6{a, b) + 5(c, e). 
However it is easy to verify that neighbor-joining constructs a tree with the same topol- 
ogy as T. The example shows that neighbor-joining can construct the correct tree even 
when the naive quartet method and ADDTREE fail. 

The next example shows that Theorem fails for trees with more than 7 taxa. 

Example 11. Let T be the 8 leaf tree shown in Figure 2 and let 5t be its corresponding 
tree metric. 
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Figure 2. An eight leaf tree. 
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Consider the distorted dissimilarity map 
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that is quartet consistent with T. It is easy to see that Q8{x^ y) = —6.24, while Qs{a, b) = 
Qs{fn,n) = —6.04. Therefore the function Qs is not minimized at one of the cherries of 
T, and the neighbor-joining algorithm applied to 5 outputs a tree different from T, in 
which X, y form a cherry. 

Fortunately, there is a single extra condition which ensures that neighbor-joining cor- 
rectly reconstructs a tree. In what follows we say that a leaf x is interior to a quartet 
{ij : kl) in a tree T if none of {ik : xl), {ik : xj), {ix : jl), or {kx : jl) are quartets in T. 



8 



RADU MIHAESCU, DAN LEVY, AND LIOR PACHTER 



J 



*^^y xl *\l 



Figure 3. The quartet additivity configuration. 



Definition 12. A dissimilarity map 5 : X x X ^ M is quartet additive witli a tree T 
if for every {ij : kl) witli x interior to {ij : kl), and y not interior to {ij : kl) sucli 
tliat {ij : xy) is not a quartet of T, we have w{kl : xy) > w{ij : xy) (see Figure 3). 

We conclude with three basic lemmas that are important in the next section. The 
proofs are left as an exercise for the reader. 

Lemma 13. Quartet consistency and additivity are both invariant with respect to the 
shifting operation. 

Lemma 14 (Spectator Lemma). For any a, h, x, y,t E X , 



The Spectator Lemma is used to prove 

Lemma 15. For any a, b, c, i, j, x,y E X , 

3w{ab : xy) — 3w{ac : xy) = w{ab : xc) — w{ac : xb) + w{ab : yc) — w{ac : yb), 
4:w{ab : xy) — 4:w{ij : xy) = w*{ab : x : ij) + w*{ab : y : ij), 

where w*{ab : x : ij) — w{ab : ix) + w{ab : jx) — w{ax : ij) — w{bx : ij). 

4. A CONSISTENCY THEOREM FOR NEIGHBOR-JOINING 

Theorem 16. If 5 : X x X ^ is quartet consistent and quartet additive with a tree 
T, then neighbor-joining applied to 5 will construct a tree with the same topology as T. 

The proof of the theorem consists of two main parts. First, we show that when 
the reduction step collapses a pair of taxa (x, y) which form a cherry in T, then the 
dissimilarity metric given by reducing 5 is quartet consistent and additive with the tree 
T' obtained by clipping off the cherry {x, y) and labeling its former root (now a leaf) with 
the new taxon thus created in the reduction step. In other words, the node obtained by 
collapsing {x, y) in 6 is assigned to the location of the common ancestor of x and y in 
the true tree T. Note that this only makes sense when x and y do indeed form a cherry, 
as this is the only case in which their common ancestor is well defined. The second part 
of the inductive proof is showing that given S quartet consistent and quartet additive 
with a tree T, the optimal pair for the reduction step is guaranteed to form a cherry in 
T. This clearly completes the inductive argument. We proceed with the first step. 



2w{ab : xy) — w{tb : xy) + w{at 



xy) + w{ab : ty) + w{ab : xt). 
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Lemma 17. Quartet consistency is maintained when reducing a cherry of the reference 
tree. Formally, given a pair of taxa x,y E X which form a cherry in T, the result of 
collapsing {x, y) in 5 is quartet consistent with the topology given by deleting the leaf 
edges leading to x and y in T and assigning the new member of the set of taxa to the 
new leaf thus formed in T . 

Proof: Without loss of generality, we may assume that we are collapsing taxa x,y 
into taxon z by using the first type of reduction step. For a set f/ C X, we let T\u be 
the topology induced by T on U . Note that for Xi = X — {x} and X2 = X — {y}, 
the two topologies T|xi and T\x2 are isomorphic under the map / : Xi — > X2 defined 
by f{u) = u for u ^ x,y and f{y) = x. We furthermore notice that 6\xi is quartet 
consistent with T\xi for i = 1,2 since 6 is quartet consistent with T. Therefore 6\xi will 
be quartet consistent with T' under identifying y 01 x with z. We note again that this 
property holds if and only if (x, y) form a cherry in T. 

Now note that the reduced distance metric 5' on the set X' = X ~ {x, y} U {z} is in 
fact a linear combination of the two dissimilarity maps and 5\x2 under identifying z 
with y in Xi and x in X2 (We define the restriction of the dissimilarity map to a subset 
of its domain in the obvious way) . 

The last piece of the proof involves noticing that the condition of quartet consistency 
with respect to a topology T is a set of linear inequalities (defined by T), on the values 
S{i,j), or "linear in 6" for short. Concretely, this means that any linear combination 
of dissimilarity maps that are quartet consistent with a topology T will also be quartet 
consistent with T. As noted above, both 6\xi and 6\x2 are quartet consistent with T', 
while S' is a linear combination of the two under the obvious isomorphisms (in fact 6' is 
the mean of 6\xi and S\x2)- We conclude that 6' is quartet consistent with T'. □ 

Lemma 18. Quartet additivity is maintained when reducing a cherry of the reference 
tree. Formally, given a pair of taxa x,y E X which form a cherry in T , the result of 
collapsing (x, y) in 5 is quartet additive with respect to the topology given by deleting the 
leaf edges leading to x and y in T and assigning the new member of the set of taxa to 
the new leaf thus formed in T . 

Proof: We note that quartet additivity is also a property that is linear in 5. The 
proof proceeds identically with the previous lemma. 

Proof of Theorem ll61 By the above lemmas, it suffices to prove that at any step, the 
pair of taxa which maximize the Z-criterion form a cherry. We argue by contradiction. 
Let us consider a pair 5, T such that («, j) is a pair of taxa which maximizes Zs, but does 
not form a cherry in T. Throughout the proof, and in the remainder of the paper, we 
multiply Zs by ("2 ^) to simplify the formulas. 

Case 1 : Suppose i or j are part of a cherry. Without loss of generality, assume that 
leaf i forms a cherry with leaf k ^ j and let X' = X — {i,j, k}. Then: 





V(x,3;)6(f ) 
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Applying Lemma [T5l to the second summand, we have: 
Zsii, k) - Zs{i,j) = ^ w{ik : xj) - w{ij : xk) 



+ - wiik : xj) — w{ij : xk) + w{ik : yj) — w{ij : yk) 



3 

Vx,t/G(^ 

n — 1 



3 



w{ik : xj) — w{ij : xk). 



Since {ik : xj) is a quartet in T for any x, by consistency w{ik : xj) — w{ij : xk) > 
for all X and therefore ^^(z, k) — Zs{i,j) > 0, a contradiction. 



A. 



Figure 4. Case 2 in Theorem 16. 

Case 2: Neither i nor j are part of a cherry. Then, we have a situation as in Figure 4. 
Let / be the set of leaves on the subtree nearest i along the path from i to j and, similarly, 
let J be the set of leaves on the subtree nearest j. Without loss of generality, we assume 
that |/| < I J| and let the pair a, 6 G / be a cherry in T. Let X' = X — {a, b, i,j}. Then 



Zsia,b) - Zsii,j) 



w{ab : ix) + w{ab : jx) — w{ax : ij) — w{bx : ij) 



By Lemma [T5| 
Zs{a,b) - Zs{i,j] 



X : I] 



+ - w*{ab : X : ij) + w*{ab : ?/ : ij) 



— 1 



w*{ab : X : ij) 



First note that if x G X' is a leaf that is not in / or in J (i.e., the paths from i to 
X and j to x intersect the path from i to j), then w{ab : xy) > w{ij : xy) for any leaf 
y by quartet consistency. Thus we restrict our attention to the leaves in / and J. Let 
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I' = I — {a, 6}. Then, since |/| < |J|, it follows that |/'| < \X' — I'\ and we choose 
a subset of J that is the same size as /'. In particular, there exists I* G J such that 
|/*| = |/'| = p. Let /' = {xi,...,xp}, r = {yi,...,yp} and X" = X' - I' - I*. Then: 

^ p 
Y iZs{a,b) - ZsiiJ)) > '^w*{ab : Xp : ij) + w*{ah : Vp : ij) + ^ w*{ab : x : ij) 

p=l x£X" 

p 

= 4 ^ w(a6 : XpUp) - w{ij : + w{ij : ax)] 

p=i 
2 

+ : jx) — w{ij : bx)]. 

o 

Since {ab : ix), (ai : jx) and (6i : jx) are all quartets in T, by consistency each term 
is positive. Therefore w*{ab : x : ij) > and so Zs{a,b) > Zs{i,j), a contradiction. □ 

Remark 19. Theorem [9] follows from the observation that quartet consistency suffices 
in the proof of Theorem [16] when 4 < |X| < 7. Details are omitted. 

Corollary 20. Neighbor-joining has l^o radius of at least |. Following Atteson's results 

from [Ij for the reverse inequality, we conclude that neighbor- joining has loo radius equal 
1 

2 ■ 

Proof: It is easy to see that if St is a tree metric and S is a metric with maXijlSxii, j) — 
5{i,j)\ < ^mine£E(T)K^) where /(e) is the length of edge e in T, then 6 is quartet 
consistent and quartet additive with T. □ 

The next corollary extends the Visibility Lemma from |6j. A taxon b is visible from a 
with respect to a dissimilarity map 6 if 

b = argmax^^aZs{a,x). 

Corollary 21. If 6 is quartet consistent with a tree T and {a,b) is a cherry of T , then 
b is visible from a with respect to 5. 

Proof: This follows directly from the first step of the proof of Theorem [161 D 

The Visibility Lemma is the key to developing a fast neighbor joining algorithm (FNJ) 
that has optimal run time complexity O(n^) [H]. In fact, we can conclude that 

Corollary 22. // 5 is quartet consistent and quartet additive with respect to a tree T 
then FNJ will reconstruct T from 5 with run time complexity 0{n'^) (which is also the 
size of the algorithm input). 

5. The Edge Radius of the neighbor- joining Algorithm 

In this section we prove a strengthening of the following conjecture about the edge 
radius of the neighbor-joining algorithm: 
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Conjecture 23 (Atteson |T]). Let T be a phylogenetic X-tree with associated tree 
metric 5t, and let e be an edge of T of length /(e). If 5 is a dissimilarity map whose 
loo distance to St is less than then neighbor-joining applied to S will reconstruct 
the edge e correctly, i.e., the tree T' output by neighbor-joining will contain an edge e' 
which induces the same split in the tree T' as e induces in T. 

Since the necessary l^o error bound needed for the neighbor-joining algorithm to re- 
construct an edge correctly is | of the length of the edge, we say that the edge radius of 
neighbor-joining is |. This result is optimal. In [T], Atteson presents an example where 

the statement fails for l^o error larger than 

For ease of exposition, in what follows we will drop the requirement that the input 
trees are binary. In other words, we will allow internal nodes of degree higher than three. 
Note that an internal node of degree at least 4 corresponds to one or more internal edges 
of length in a binary tree. We also continue to allow negative leaf edges. Since in 
this section we are only concerned with recovering splits corresponding to edges of at 
lest a certain length in the reference tree, edges of zero size can easily be allowed as our 
requirements for reconstructing them become null. Also, since leaf edges are recovered 
correctly by design (they correspond to trivial splits), negative leaf edges can easily be 
allowed without affecting the analysis. This is also a consequence of Lemma HI 

Definition 24. Let T be a tree and e a non-leaf edge of length / in T, corresponding to 
the split A\B of X. We say that a dissimilarity map 6 is A\B- consistent with respect to 
the tree metric 6t if the following conditions hold 

• 5{x, y) — 6t{x, y) < for all pairs x,y G A and all pairs x,y E B, 

• \5{x,y) — 6T{x,y)\ < ^ for all pairs x E A and y E B. 

We are ready to state the main theorem of the section 

Theorem 25. If 6 is an A\B- consistent dissimilarity map with respect to the tree T and 
the tree metric 6t, then the neighbor- joining algorithm applied to S will output a tree T' 
which contains A\B among its edge-induced splits. 

This implies more than Atteson's conjecture since we are not imposing a lower bound 
on the estimated distances between taxa situated on the same side of the split. This 
observation is crucial; we show in Theorem 34 that Atteson's condition as originally 
stated in pp does not hold inductively. As we will see. Lemma 7 in fails if one is 
only trying to recover a specific edge in the tree, because non-neighboring pairs may be 
collapsed during the agglomeration steps. 

The proof of Theorem 25 is based on two propositions. In Proposition 26, we show 
that agglomerating a pair of elements in A or a pair of elements in B preserves A| in- 
consistency. It therefore suffices to show that at every step of the algorithm, the Z- 
criterion is maximized for a pair that lies either in A or in 5. This is shown in Proposi- 
tion 27 by an averaging argument. We will assume, by way of contradiction, that there is 
a pair of leaves (i, j) with i E A, j E B and Zs{i,j) maximal. To obtain a contradiction, 
we will assume without loss of generality that \A\ < \B\ and show that, on average. 
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Zs{ai,a2) — Zs{i,j) > where the average is taken over all pairs ai,a2 G (^). Conse- 
quently, there must be at least one pair of elements x,y & A such that Zs{x, y) > Zs{i, j). 

Proposition 26. Given an A\B- consistent dissimilarity map 6, with respect to a tree T, 
collapsing a pair of taxa x,y & A will result in a metric S' which is A! \B- consistent with 
respect to a tree T' . Here A' is the set of taxa obtained by replacing x,y by the collapsed 
node z in A. 




Figure 5. The collapsing lemma. 

Proof: As before, wc assume without loss of generality that we are using the first 
variant of the reduction step. Now shift the new dissimilarity map 6' by ^'^^^'^^ around 
the new taxon z. Again, this can be done without affecting the outcome of the algorithm. 
In effect, this is equivalent to defining the distances with respect to z in the following 
manner: 

S'{z,a) = -{5{x,a) + 5{y,a) - 5T{x,y)). 

Now let e be the edge in T corresponding to the A\B split. Let / be its length. Let 
p be the path in T that joins x and y. Then e ^ p and let o be the internal point of 
T where a path from e reaches p. Consider the new tree T' where the taxa x and y 
are removed and the new taxon z is placed exactly at the internal point o. 5t' and A' 
are defined in the obvious way and 5' is defined by collapsing x,y in 5 according to the 
reduction described above. 

Let To . . . Tfc be the subtrees of T hanging off the path p and let Tq be the one that 
contains e. We now observe that 6' {a, b) = 6{a, b) and Sr'ia, b) = Sria, b) for a,b z. In 
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this case the errors remain the same. For 6 e -B, and therefore 6 G Tq, we observe that 
ST'{z,b) = ST{o,b) = ^{ST{x,b) +ST{y,b) - ST{x,y)). 

Therefore 

\5'iz, b) - z)\ = \{5{x, b) - 5t{x, b)) + {5{y, b) - ^^(y, b)\\<\- 

Now for all i let Oj be the point of the path p that is the root of the subtree Tj. Then 
for any a G A, so a G Tj for i 7^ or a G Tq Pi ^4, 

5'{z, a) = 5T'{a, Oi) + [{5{x, a) - St{x, a)) + {S{y, a) - Sriy, a))]/2 = 
Sr'ia, z) + [{S{x, a) - 5t{x, a)) + {S{y, a) - Sriy, a))]/2 - St{o, Oi) < 

ST'ia,z) + ^ - STio,Oi). 

Therefore 6' is A' 1 5-consistent with respect to T' and T' "contains" the edge e. □ 
Now assume that the pair which maximizes Zs{i,j) is such that i E A,j G B. 
Without loss of generality, we assume that \A\ < \B\. 

Proposition 27. 

S{5:A,iJ)= ^ Zs{ai,a2) - Zs{iJ) > 0. 

That is, there exist x,y & (2) such that Zs{x,y) > Zs{i,j). 

We prove the proposition by comparing the differences between the dissimilarity values 
and the true tree metric St, with counts of the contested edge in the averaging step. The 
calculation is elementary but tedious, and requires a few lemmas and some notation: 

Let M be the set of all dissimilarity maps 5 : X x X — > R on a set X. We represent 
a linear function / : M ^ R by 

5 1— > af{a,b)6{a,b). 

V(a,&)G(^) 

Similarly, if Mr is the set of all tree metrics on a set X, we represent a linear function 
/ : ^ M by 

VeeE(T) 

More explicitly, /?jj-(e) is the coefficient of /(e) in /{St), when the metric 6t corresponds 
to the edge lengths /. If T is obvious from the context, we will write as 

For a tree T, note that an edge e divides the leaves into two sets. For a given tree 
leaf i, we define N^{i) as the set of leaves on the same side of e as i and N~(i) as the 
set of leaves on the far side of e from i. By a slight abuse of notation, we will also use 
A*"^ {{) and (i) to denote the number of elements in their respective sets when such 
use does not give rise to confusion. 
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Lemma 28. Let a, b be leaves in a tree T. Then 

( -(iV+(a)-l)(iV-(a)-l) zfeePat; 
I3z(a,b){e) = < 

N~ {a){N~{a) — 1), otherwise. 

Proof: For a tree metric 6t, ws^{ab : xy) is twice the length of the sphtting path 
if {ab : xy) is a quartet of T and negative the length of the path otherwise. Hence, if 
an edge e is on the path from a to 6, then there is no quartet {ab : xy) such that e 
is on the splitting path. Hence, the edge e is only counted negatively, once for every 
element x e N:^{a) — {a} and y G N^{b) — {b} = N~{a) — {b}. There are exactly 
{N^{a) — l)(N~{a) — 1) such pairs {x,y). 

If e is not on the path from a to b, then for any pair of elements (x, y) G (^"2*'"^) > 
{ab : xy) are exactly the quartets of T of the form {ab : ■■) with e is on the splitting 

path. Consequently, /3z(a,6)(e) = 2 (^'=7''^) = ^e" (a) (^e" (a) - !)• □ 

Lemma 29. Let an edge e define a split A\B and assume \A\ < \B\. Then 

S{6T:A,t,j)> f'fV|i?|-l)(n-l)/(e). 



^ 2 ^ 

Proof: This is equivalent to showing that 

/3si5r:A,,,){e)=(^'^^y\B\-l){n-l), 

and for any other edge e' G T, Ps{ST)i^') — 0- Note that since e is never on the path 
between ai and 02 for any pair ai, 02 G A, l3z{ai,a2) is the same for any choice of Oi, 02. 
Also, e is on the path between i and j, so by application of Lemma [281 

Ps{ST:A,i,j){e) = ^ /5z(ai,a2)(e) - /5z(i,i)(e) 

(ai,a2)e(^) 

[Pz{a^,a2){e) - Pz{i,j){e)] 



[\B\{\B\-l) + {\A\-l){\B\-l)] 
{\B\-l){n-l). 



Let t be either vertex of the edge e. Assume we have an edge e' 7^ e such that e' is in 
the subtree spanned by B, denoted e' G [B]. Then e' is never on the path between ai 
and 02, so Pz{ai,a2){^') = {'t){N^ {t) — 1) for any pair ai,a2 G A. If e' is on the path 
between i and j, then [3z{i,j){e') < 0, so clearly /55(5j,:A,jj)(e') > 0. If e' is not on the 
path between i and j, then Pz{i,j){€-') = l3z(ai,a2){^') Ps(5T)i^') = 0- The proof of the 
final case that needs to be considered, namely e' G [A] consists of a trivial, yet slightly 
lengthy, counting argument. We only present a brief sketch. 
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Again, in the case when e' is on the path between i and j, then f3z(i,j){e') < 0, so 
clearly l3s{8T.A,i,j){e!) > 0. For e' not on the path from i to j, let A' — N~,{i) C A. Let 
a — \A!\. Then a simple counting argument shows that (iz{i,j){&') — 2(2), whereas for 
/ = Z](ai a2)e(^) ^s{ai,a2) we have /3/(e') = 2(2) ('f'). Since |S| > \A\, this concludes 
the proof. □ 

Lemma 30. Let a, b be elements of the leaf set X of size n. Then: 



OiZ{a,b){x,y) 



{ -(V) ^/|{a,6}n{a;,|/}| = 2; 
i(n-3) z/|{a,6}n{a;,y}| = l; 
-1 z/|{a,6}n{a;,|/}| = 0. 



Proof: The term 5(a, b) occurs in all ("2^) w{ab : xy) terms in Z(a, 6), each time with 
a coefficient of —1. To compute 0(z(a,b){(i,x), we note that S{a,x) occurs in all (n — 3) 
terms of the form w{ab : x-) with a coefficient of |. The same holds for az{a,b)ib,x). 
Lastly, S{x, y) for x,y ^ a,b occurs only in one term, w{ab : xy), with coefficient —1. □ 

Lemma 31. Let a, oi 7^ 02 G A — {i} and b,bi ^ b2 & B — {j}. We have 



(1) 


as{s)ii,j) = 












(2) 


ois(d){ha) = 


-(l?l)-i(n-3)(l-l). 










(3) 


"5(5) (j» = 


i(|^|-l)(|i?|-l)-|(n-3)(l^l) = 








-2), 


(4) 


ois(S)(hb) = 


i(|A|-l)(|S|-l)-i(rz-3)KI) = 




-l)(l^l 


-l)(l^l 


-2), 


(5) 


ois{S){j,b) = 


-Kl)-|(n-3)(l-l), 










(6) 


ois{5){a,b) = 


1(1^1 -1)(|S|-1) + (1^1), 










(7) 


"5(5) (01,02) 












(8) 


"S(5)(«l,«2) 


= -('^') + K') = 0- 











T/iere are 1, |A| - 1, \A\ - 1, |B| - 1, |5| - 1, - 1){\B\ - 1), (l^'"^) and {^\^) of each 
of these terms, respectively. 
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Proof: Let Si{6) = E(ai,a2)e(^) ^5(^1, ^2) Note that 

f -('?') if |{x,y}nA|=2; 

1{\A\-1){\B\-1) if\{x,y}nA\ = l; 



I -('-?) 



if \{x,y}nA\ = 0. 



We provide the proof for the first case, where x,y E A (the other cases are similar). Let 
A' = A — {x, y}, then: 



OiSi{5) 



azs{x,y){x,y) + ^azs{a,x){x,y) + azs(a,y){x,y) + ^ azs(ai,a2)ix,y) 



aeA' 



n-2 
2 



+ \A'\ 



l(^_3) + l(n_3) 



+ 



\A'\ 
2 



ai,a2G( 



(-1) 



-{n-\A\)[\A\-n + l] 



B 
2 



□ 



Identities (l)-(8) now follow after some elementary algebra and Lemma [301 
Lemma 32. 

S{6t) - S{5) < ^-{\A\ - l)(n - l)(3|A|n - 2n - 2\A\^ - A\A\ + 4) 
8 

Proof: Let 6 be an y4|i?-consistent metric and set 6 = 6t — S. Note that for all 



This follows directly from the fact that as{5){x,y) < for all cases where x,y E A or 
x,y E B, together with the signs of the terms in the Lemma [31] and the definition of 
yl|i?-consistency. It follows that 



S{S)= as^s-^{x,y)6{x,y) < ^ \as(5)i^,y)\ 



and the lemma follows by summing the terms appearing in Lemma [31] □ 
Proof of Proposition 27 and Theorem 25: By Lemma 29 and 32 it suffices to 
show that 

-\A\{\A\ - 1){\B\ - l){n - 1) - -{\A\ - l){n - l)(3|A|n - 2n - 2\A\'^ - A\A\ + 4) > 0. 
2 8 

This inequality follows from the fact that we chose \A\ < \B\ and consequently 2\A\ < n. 
Thus, the pair of leaves that maximize {Z-, ■) are both in A. The theorem now follows 
from Proposition 26. □ 
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We conclude by stating that our analysis holds trivially for the fast neighbor-joining 
algorithm of [Gj. This follows from the observation that for an 74|i?-consistent dissimi- 
larity map 6, no pair x, y with x & A and y E B can maximize the Z{-, ■) criterion, and 
therefore the maximizing pair, which has to be visible from both of its members, has 
both taxa on the same side of the partition A\B. 

Corollary 33. // 6 is an A\B -consistent dissimilarity map with respect to a tree T , 
then FN J applied to 6 will reconstruct a tree T' which contains A\B among its set of 
edge-induced splits. 

We conclude with a final comment on yl|i?-consistency and our proof of Theorem 25. 

Theorem 34. Let 6t be a tree metric and 6 a dissimilarity map whose /qo distance 
to 6t is less than j where (3 is the length of some edge in T. Then it may be that 
an intermediate tree produced during the agglomeration steps of the neighbor- joining 
algorithm has Zoo distance greater than | to any tree metric. 

Proof: Consider the phylogenetic tree T in Figure [5], with leaf set S' = X U {i,j, a, b} 
where |X| = n. Suppose that the leaf edges corresponding to i,j,a,b all have length 
a. The lengths of the other visible edges, i.e. the ones not belonging to T\x are as in 
Figure [5l Also suppose that the edges on the subtree T\x have total length < e, where 
we will let e become arbitrarily small. 




Figure 6. Example for Theorem 34. 

Consider the following dissimilarity map S with 1 15 — 5^1 |oo = 1: 

(1) 5{t,j)=5T{t,j)-l, 

(2) 6ip) = 5t{p) + 1 for p = (a, 6), (a, z), (6, j), 

(3) 5{p) = 5t{p) forp= {a,j),{b,i), 

(4) 6{x,y) = ST{x,y) for x,y e X, 

(5) 6{i, x) = dxii, x) + 1 for a; G X, 

(6) x) = SriJ, x) + 1 for X G X, 

(7) S{a, x) = 6T{a, x) — 1 for x G X, 

(8) S{b, x) = Srib, x) - 1 for x G X. 
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Minimizing the neighbor-joining Q criterion is equivalent to maximizing 

Y,{k,l)=25{k,l)+ '^'^'^W 

t&S-{k,l} 

where 5k,i{t) = S{k, t) + 5{l, t) - 5{k, I). 

First we show that for large enough n, small enough e and with /3 > 4, the pair that 
maximizes Y is (i, j). Note that for x,y & X, Y{x, y) — X]te{i j k i} '^^x,y{t) + 0(e), which 
converges to a constant as e ^ 0. However, as n — > oo, Y{i,j) « 2n/?. Therefore for 
small enough e and large enough n, [i,]) will dominate any pair {x,y) G (^). Since 
/5 > 4 and ||5 — 5t||oo = 1; using a similar argument as above we can also conclude that 
the optimum pair must consist of two leaves from {i-iji a, 6}. 

Finally, we need to show that Y{i,j) > Y{k, I) where either A; or Hs equal to a or b. 
Note that for any k, I e {i,j, a, b}, 

Y{k,l)^2S{k,l)+ J2 hAt) + J2^kAt)- 

te{a,b}} tex 

The first and second summands arc constants in n, while the third is composed of n sub- 
terms which arc roughly equal, up to small variations of size at most 0(e), depending 
on the location of t in T|x. Since we can choose e arbitrarily small, we can therefore 
ignore this error. Now place a fictitious node v at the root of T\x (the right hand side 
of the edge of length /3).Then letting n — > oo, we see that asymptotically, 

Y{k,l) ^ n5k,i{v)- 

Here we extend the definition of 5 to v by extending 5t to v in the natural way and 
defining the error 5 — St for pairs involving v in the same way as for other leaves in X. 

It is now easy to verify that 

(1) 6,Av) = 6{i,v)+6{j,v)-6{i,j) = 2(a + /5+1.75) + l + l-(2a + 3.5-l) = 2/3 + 3, 

(2) di,aiv) = dj,biv) = d{i, v) + d{a, v) - d{i, a) = 2{a + (3+ 1.75) + 1 - 1 - (2a + 1) = 
2/3 + 2.5, 

(3) Sj^a{v) = Si^b{v) = S{j,v)+S{a,v)-S{j,a) = 2(a+/3+1.75)+l-l-(2a+3.5) = 2p, 

(4) Sa,b{v) = (){a,v) + S{b,v)-S{a,b) = 2(a+/3 + 1.75)-l-l-(2a + 3.5 + l) = 2/3-3. 

Therefore asymptotically, (i, j) will be collapsed in the first step of neighbor-joining 
applied to S. 

Now consider the reduced distance matrix 5' : 5" x 5" — > R, where the leaves i and j 
are replaced by a new leaf u. That is, S' = S — {i,j}U {u}. We restrict our attention to 
the set {a, b, u, x} for an arbitrary leaf of x of X. Since the total length of T\x is 0(e), 
we can in fact approximate expressions involving S{]x) by S{]v) up to 0(e) error. We do 
so for the sake of simphcity. 

Simple calculations now give 

S'{a,u) = S'{b,u) = 2a + 2.2b, 
S'{a, x) = 5'{b, x) = a + P + .75, 
d'{a,b) = d{a,b) = 2a + 3.5, 
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5'{u,x) = a + p + 2.75, 

and thus: 

6' {a, u) + S'{x, b) = 5'{b, u) + 5\x, a) = S'{x, u) + 6'{a, b) - 4.25 = /5 + 3a + 3. 

Now suppose that there is some additive tree metric such that 1 15' — yu| |oo < 1- Then 
by adding this to the above equahty we obtain that 

u) + /i(a, b) — ji{b, u) — fi{x, a) > 4.25 — 4 > 

and similarly 

fi{x, u) + /i(a, b) — /i(a, u) — /i(a;, b) > 4.25 — 4 > 0. 

This contradicts the four point condition necessary for /i to be a tree metric. Therefore, 
we have given an example of a dissimilarity map 5 and a tree metric 5t with an edge of 
length /3, such that 1 15 — 5t| loo < f , and yet the loo distance of the reduced dissimilarity 
map after the first agglomeration step from any tree metric is greater than |. □ 
The significance of Theorem 34 is that it shows that an inductive proof of Atteson's 
conjecture is not possible without relaxing the hypothesis. Thus, the partial result of 
[30] and the proof of [1] are incorrect. At the same time. Theorem [Ml also identifies an 
undesirable property of neighbor-joining which is very common among greedy optimiza- 
tion algorithms. We hope that further investigations in this direction can yield more 
robust versions of the algorithm. 



6. Simulation results and conclusion 

We performed a series of simulations to test how frequently Theorem 16 explains 
the success of the neighbor-joining algorithm. Trees with 20 taxa were generated by 
agglomerating pairs at random. We generated 35 such trees and set their edge lengths to 
0.1. We then used seq-gen [22] to build 100 alignments with the Jukes Cantor model for 
each of 28 sequence lengths between 100 and 10,000 base pairs. From these we obtained 
dissimilarity maps using the dnadist program from the PHYLIP package [9]. Our Java 
API then computed the w-matrix for each dissimilarity map, tested the consistency 
and additivity conditions against the true tree, checked to see if the dissimilarity map 
satisfied Atteson's criterion (Theorem 2), and computed the neighbor-joining tree. The 
results are summarized in Figure [61 

We note that our conditions of additivity and consistency are satisfied for sequence 
lengths an order of magnitude smaller than required for Atteson's criterion to hold. 
Moreover, even when the additivity and consistency conditions are not satisfied for every 
quartet, they do hold true for upwards of 99% and 94% of quartets respectively, even 
at sequence length 100 (Figure 7). Hence, applying an averaging argument similar to 
the one we employed in the proof of Atteson edge radius conjecture, we may obtain "on 
average" conditions that explain even more of the cases where neighbor-joining succeeds. 



WHY NEIGHBOR-JOINING WORKS 



21 




100 1,000 10,000 

sequence length (BP) 



Figure 7. Conditions satisfied as a function of sequence length. 35 trees 
with 20 taxa each were simulated 100 times for 28 different sequence 
lengths. 100 alignments were generated for each tree. The figure shows, 
for each of the 9800000 dissimilarity maps generated from the simulations, 
whether neighbor- joining reconstructed the correct tree, and whether ad- 
ditivity, consistency and Atteson's criterion were satisfied. Note that the 
X-axis is logarithmic. 
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